Modeling and Fine Tuning

Owner: Daniel Soukup - Created: 2025.11.01

In this notebook, we load the processed data and fit our models.

Data loading

Let's load our processed data and create feature/target dataframes for both train and test.

We notice that some special character can cause issues in training - we address this here.

Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.

Important Note: We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model as an unbiased estimate of our model on completely unseen data.

Modeling

Our current approach will focus on optimizing an XGBoost binary classifier. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:

Fit Baseline

As mentioned, we'll be using sample weights to adjust for the class imbalance:

Next, we set up our cross-validation helper:

Lets test our function:

Lets try with a large multiplier:

We can see that the multiplier has a massive effect on the aucpr score.

Optimize Hyperparameters - Main Run

Next, we'll look to optimize the model hyperparameters. As we do this, our experiments will be tracked using MLflow.

The function below defines the HP space to explore (parameters and their ranges), focusing on 5 such parameters with known strong effect on model performance and regularization:

Finally, we are ready to run our study, currently consisting of 40 trials:

Lets see the best results:

Tuning Analysis

Let's see how the HP choices impacted performance:

We will look at different projections of the HP space and the best observed values:

The best performing models were found with the mid-to-higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the number of estimators is high).

In our experiemnts, the high scores also corresponded with smaller col samples (how many col's each estimater used) unless max depth was singificantly lowered. The small col sample again helps avoid overfitting although the patterns are maybe less clear.

While the patterns might not be the most clear here, we can see that having high boosting rounds and high sample leads to lower scores (the bottom right corner, likely overfitting again).

Given that some of the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger step size for boosting rounds.

Finally, we look at the multiplier effect:

On average, the higher the multiplier the better the aucpr score we got which is also show in the heatmaps below. It looks like we get the most benefit around >40 weighting.

This patter is nicely shown the heatmap above and below as well.

Predict

We save the predicted class and probabilities both calculated:

Interpretation

Finally lets look at the feature importances for our model too (top 20):

Observations:

All these findings align with our expectations and EDA. Our model picked up on the gender bias in our data (there are much more high earner males in the dataset than female) which can definitely be addressed in future model iterations - please see the slides for more info.

79% of high income earners were male, as opposed to 46% of low income. This statistical inparity is a strong signal for the model to pick up on and use for classification.

Save predictions

We finally save the results to their own datasets which can be used for evaluation: